Towards Automatic Annotation of Text Type Structure: Experiments Using an Xml-annotated Corpus and Automatic Text Classification Methods

نویسندگان

  • Hagen Langer
  • Harald Lüngen
  • Petra Saskia Bayerl
چکیده

Scientific articles exhibit a fairly conventionalized structure in terms of topic types such as background, researchTopic, method and their ordering and rhetorical interrelations. This paper describes an effort to make such structures explicit by providing a corpus of German linguistic articles with XML markup according to a text type schema defining 21 topic type categories. The corpus is further augmented with XML annotations on a grammatical level and a logical structure level. The efficiency of an automatic annotation of text type structure is explored in experiments that apply general, domain-independent automatic text classification methods to text segments and employ features from the raw text level and the corpus annotations on the grammatical level. The results indicate that some of our topic types are successfully learnable.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Subtopic annotation and automatic segmentation for news texts in Brazilian Portuguese

Subtopic segmentation aims to break documents into subtopical text passages, which develop a main topic in a text. Being capable of automatically detecting subtopics is very useful for several Natural Language Processing applications. For instance, in automatic summarisation, having the subtopics at hand enables the production of summaries with good subtopic coverage. Given the usefulness of su...

متن کامل

Text Type Structure And Logical Document Structure

Most research on automated categorization of documents has concentrated on the assignment of one or many categories to a whole text. However, new applications, e.g. in the area of the Semantic Web, require a richer and more fine-grained annotation of documents, such as detailed thematic information about the parts of a document. Hence we investigate the automatic categorization of text segments...

متن کامل

Annotation and Classification of Argumentative Writing Revisions

This paper explores the annotation and classification of students’ revision behaviors in argumentative writing. A sentence-level revision schema is proposed to capture why and how students make revisions. Based on the proposed schema, a small corpus of student essays and revisions was annotated. Studies show that manual annotation is reliable with the schema and the annotated information helpfu...

متن کامل

Perceptually Motivated Parameters for Automatic Prosodic Annotation

This contribution presents an approach to automatic prosodic annotation which emphasizes the linguistic motivation and perceptual relevance of the features used for classifying the prosodic categories. The analyses and experiments presented here were conducted on a 2.5 hours German news-like corpus which had been manually annotated using GToBI(S) (Mayer, 1995). GToBI(S) is an adaptation of Amer...

متن کامل

Building a Discourse-Annotated Dutch Text Corpus

We are compiling a corpus of Dutch texts annotated with discourse structure and lexical cohesion, containing initially 80 texts from expository and persuasive genres. We are using this resource for corpus-based studies of discourse relations, discourse markers, cohesion, and genre differences. We are also exploring the possibilities of automatic text segmentation and semi-automatic discourse an...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004